Entity Level Data Integration by Statistical Methods

نویسنده

  • Hans-Joachim Lenz
چکیده

In most cases unique identifiers are required to join data from different databases. If global unique keys are absent or corrupted the supplement of data extracted from different sources becomes difficult. The main question is: Does a given record is related to an entity which is identical to an entity corresponding to another record, or not? This leads to a classification problem with at least two classes: identical and not identical. Classifying pairs of records needs a three-step procedure. The first step is to define suitable common properties (attributes) of data for all different sources. Secondly, to allow comparisons the values of the records are transformed to this common properties. Finally, the classification is performed on an almost finite subset, the range of an appropriate comparison function. Different classification techniques can be applied like Association Rules, Classification Trees, Neural networks or Record Linkage techniques. The unknown parameters of the classification rules are computed by sampling and supervised learning. Unbiased error rates can be estimated for instance by cross validation. Special attention must be paid to control the computing complexity of the identification process. The approach will be illustrated for data from two library databases and from the planned German administrative record census, which will become a substitute of a regular census.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

People Summarization by Combining Named Entity Recognition and Relation Extraction

The two most important tasks in entity information summarization from the Web are named entity recognition and relation extraction. Little work has been done toward an integrated statistical model for understanding both named entities and their relationships. Most of the previous works on relation extraction assume the named entities are pre-given. The drawbacks of these sequential models are t...

متن کامل

Incorporating Knowledge of Source Language Text in a System for Dictation of Document Translations

This paper describes methods for integrating source language and target language information for machine aided human translation (MAHT) of text documents. These methods are applied to a language translation task involving a human translator dictating a first draft translation of a source language document. A method is presented which integrates target language automatic speech recognition (ASR)...

متن کامل

An ontology based approach to the integration of entity-relationship schemas

In schema integration, schematic discrepancies occur when data in one database correspond to metadata in another. We explicitly declare the context that is the meta information relating to the source, classification, property etc of entities, relationships or attribute values in entity-relationship (ER) schemas. We present algorithms to resolve schematic discrepancies by transforming metadata i...

متن کامل

Integration of Remote Sensing and the GIS-based Methods for Provision of Cadastral Mapping of Agricultural Areas of Ardakan City

In the fifth development plan establishment, establishment of the Cadastre System of agriculture nationwide has been defined to be the work priority of institutions and organizations responsible in the area of agriculture and equity issuance in the country. In this study, the possibility of provision of the Cadastral mapping of agriculture by a integration of the data of the remote sensing and ...

متن کامل

Integration of Remote Sensing and the GIS-based Methods for Provision of Cadastral Mapping of Agricultural Areas of Ardakan City

In the fifth development plan establishment, establishment of the Cadastre System of agriculture nationwide has been defined to be the work priority of institutions and organizations responsible in the area of agriculture and equity issuance in the country. In this study, the possibility of provision of the Cadastral mapping of agriculture by a integration of the data of the remote sensing and ...

متن کامل

The Role of Students’ Social and Academic Integration in Their Evaluation of Faculties’ Educational Performance Quality in Shiraz University of Medical Sciences

Introduction: The purpose of this study was to explore the relationship between students’ social and academic integration and their evaluation of the faculties’ educational performance quality in Shiraz University of Medical Sciences. Methods: This descriptive-correlational study was performed on all students of Shiraz University of Medical Sciences. The participants (n = 431) were selected thr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003